Non-blocking PUT in CHPL_COMM=ofi #25977

jhh67 · 2024-09-23T23:13:28Z

This PR implements non-blocking PUTs in CHPL_COMM=ofi correctly. Prior to Chapel 2.2, non-blocking PUTs were implemented via blocking PUTs, and blocking PUTs would be typically be "injected". The OFI injection functionality allows the source buffer to be reused immediately, and suppresses the completion event that normally indicates that the operation is complete. As a result, both blocking and non-blocking PUTs would incorrectly be considered complete when the function returned. Because injection allows the source buffer to be reused immediately this would not cause a race on the source buffer, but it would allow PUTs to be reordered and could potentially cause hangs because the upper layers would not know that there were outstanding PUTs that might require progressing the communication endpoint. Although it doesn't appear that we saw these problems in production use, the potential existed.

This PR flips the situation around so that a blocking PUT is implemented as a non-blocking PUT followed by waiting for the PUT to complete. The lower-level communication primitives are now all non-blocking. PUTs may still be injected, but I will likely remove that functionality in a subsequent PR because 1) it doesn't make sense to inject a blocking PUT because you are going to wait for it to complete anyway, and 2) the semantics of non-blocking PUTs in Chapel are that the caller may not reuse the source buffer until the PUT completes. As a result, neither benefit from injection.

As part of implementing this PR I changed the dangling PUT logic. A "dangling PUT" one that may not be visible, and will need to be forced into visibility when required by the Chapel memory consistency model. Previously, PUTs were not allowed to dangle if transmit contexts were unbound, which led to a lot of special-case code to deal with bound and unbound transmit contexts. Now all dangling PUTs are treated the same whether or not the transmit contexts are bound, and dangling PUTs are forced into visibility when an unbound transmit context is freed. They could possibly be deferred until even later, but we expect unbound transmit contexts to be the exception so it doesn't make sense to optimize them further. This change simplifies the code without degrading performance.

jhh67 · 2024-10-02T20:09:29Z

The following is the performance of the ra-rmo benchmark under different Chapel releases. The 2.1 release is the first with FI_ORDER_RMA_WAW erroneously enabled. The PR column is this pull request that has non-blocking PUTs implemented correctly. Prior to 2.1 non-blocking PUTs were incorrectly implemented using OFI's "inject" functionality, which provided to higher performance but had the potential to hang.

Nodes	1.32	2.0	2.1	main	PR
1	0.0322198	0.0318668	0.00366552	0.0224351	0.0314711
2	0.044806	0.0460495	0.00454622	0.0306014	0.0434377
4	0.0730733	0.0737625	0.00714498	0.0502822	0.0696524
8	0.136939	0.13859	0.0143102	0.0940088	0.13141
16	0.263465	0.263353	0.027159	0.180891	0.251367
32	0.490895	0.489333	0.0547763	0.340288	0.467814
64	0.956173	0.935306	0.0992574	0.666866	0.910218

Notes:

chpl --fast test/release/examples/benchmarks/hpcc/ra.chpl -sverify=false -suseOn=false -sN_U="2**(n-4)"
./ra-rmo --n=34 -nl ${i}x2
CHPL_RT_COMM_OFI_INJECT_RMA=true
Measured on hotlum

test/runtime/configMatters/comm/large-rma/EXECENV

test/runtime/configMatters/comm/large-rma/README

test/runtime/configMatters/comm/unbound/EXECENV

test/runtime/configMatters/comm/unbound/README

test/runtime/configMatters/comm/unbound/SKIPIF

runtime/src/comm/ofi/comm-ofi.c

Previously, non-blocking PUTs were implemented via blocking PUTs, which could severely limit performance. Prior to 2.0, small PUTs invoked fi_inject_write, which essentially turned them into non-blocking PUTs, but chpl_comm_put returned as if the PUT was completed. This could cause MCM violations as well as hangs caused by not progressing the network stack properly. These deficiences were fixed in 2.0, but led to a performance regression. This commit implements non-blocking PUTs correctly, so that the chpl_comm_*nb* functions work correctly. This should restore 1.32.0 performance while avoiding MCM violations and hangs. Signed-off-by: John H. Hartman <jhh67@users.noreply.github.com>

Signed-off-by: John H. Hartman <jhh67@users.noreply.github.com>

Rewrote PUT logic so that low-level functions are non-blocking, and a blocking PUT is implemented by initiating a non-blocking PUT and waiting for it to complete. This simplifies the implementation and avoids code duplication. Signed-off-by: John H. Hartman <jhh67@users.noreply.github.com>

Allow specifying the maximum message size and maximum number of endpoings. These are intended primarily for testing. Signed-off-by: John H. Hartman <jhh67@users.noreply.github.com>

Also some code cleanup. Signed-off-by: John H. Hartman <jhh67@users.noreply.github.com>

We are now using this function to force visibility when an unbound endpoint is released, so it needs to work on unbound endpoints. Signed-off-by: John H. Hartman <jhh67@users.noreply.github.com>

Signed-off-by: John H. Hartman <jhh67@users.noreply.github.com>

Operations to force visibility are deferred until the endpoint is released, which requires the visibility bitmaps. Signed-off-by: John H. Hartman <jhh67@users.noreply.github.com>

Fixed how the number of transmit contexts needed is computed, and added some comments. Signed-off-by: John H. Hartman <jhh67@users.noreply.github.com>

Signed-off-by: John H. Hartman <jhh67@users.noreply.github.com>

Change type of numTxCtxs and numRxCtxs to size_t to match type of info->domain_attr->ep_cnt. Signed-off-by: John H. Hartman <jhh67@users.noreply.github.com>

Signed-off-by: John H. Hartman <jhh67@users.noreply.github.com>

bradcray · 2024-10-09T21:38:24Z

@jhh67 : In your table above, would Chapel 2.2 be expected to essentially match the main column?

jhh67 · 2024-10-10T13:31:22Z

@bradcray: Yes.

jhh67 force-pushed the nbput branch 2 times, most recently from ee4aa61 to 495b202 Compare September 24, 2024 20:03

jhh67 mentioned this pull request Sep 25, 2024

Add missing tciFree #25974

Merged

jhh67 force-pushed the nbput branch from 495b202 to 932b8c8 Compare September 27, 2024 14:03

jhh67 requested a review from jabraham17 October 2, 2024 20:45

jhh67 marked this pull request as ready for review October 2, 2024 20:45

jabraham17 approved these changes Oct 8, 2024

View reviewed changes

jhh67 added 22 commits October 9, 2024 11:13

Added comments

4ba2770

Signed-off-by: John H. Hartman <jhh67@users.noreply.github.com>

Free non-blocking handle after operation completes

132677d

Signed-off-by: John H. Hartman <jhh67@users.noreply.github.com>

Cleanup

11eccc1

Signed-off-by: John H. Hartman <jhh67@users.noreply.github.com>

Add environment variables for testing

08861b5

Allow specifying the maximum message size and maximum number of endpoings. These are intended primarily for testing. Signed-off-by: John H. Hartman <jhh67@users.noreply.github.com>

Free dynamically-allocated handles in ofi_put

6baa0e0

Also some code cleanup. Signed-off-by: John H. Hartman <jhh67@users.noreply.github.com>

Change forceMemFxVisAllNodes to work on unbound endpoints

2e0c389

We are now using this function to force visibility when an unbound endpoint is released, so it needs to work on unbound endpoints. Signed-off-by: John H. Hartman <jhh67@users.noreply.github.com>

Improved tci debugging output

47e75cb

Signed-off-by: John H. Hartman <jhh67@users.noreply.github.com>

Allocate visibility bitmaps for unbound endpoints

9f4079e

Operations to force visibility are deferred until the endpoint is released, which requires the visibility bitmaps. Signed-off-by: John H. Hartman <jhh67@users.noreply.github.com>

Fixed number of transmit contexts computation

06e11af

Fixed how the number of transmit contexts needed is computed, and added some comments. Signed-off-by: John H. Hartman <jhh67@users.noreply.github.com>

Added tciAlloc call site debug info

1872b30

Signed-off-by: John H. Hartman <jhh67@users.noreply.github.com>

Change type of numTxCtxs and numRxCtxs to size_t

b07b36d

Change type of numTxCtxs and numRxCtxs to size_t to match type of info->domain_attr->ep_cnt. Signed-off-by: John H. Hartman <jhh67@users.noreply.github.com>

numTxCtxs is now of type size_t

27f9312

Signed-off-by: John H. Hartman <jhh67@users.noreply.github.com>

Better comments

e3bc955

Signed-off-by: John H. Hartman <jhh67@users.noreply.github.com>

Remove trailing whitespace

80d1bc1

Signed-off-by: John H. Hartman <jhh67@users.noreply.github.com>

Run bigTransfer test with unbound endpoints

b58f4e0

Signed-off-by: John H. Hartman <jhh67@users.noreply.github.com>

Only run PUT

a174b15

Signed-off-by: John H. Hartman <jhh67@users.noreply.github.com>

Run bigTransfer tests with small fabric message size

c06b0b4

Signed-off-by: John H. Hartman <jhh67@users.noreply.github.com>

Add chpl_comm_free_nb_handle to CHPL_COMM=none

c4ff0a1

Signed-off-by: John H. Hartman <jhh67@users.noreply.github.com>

Add chpl_comm_free_nb_handle to CHPL_COMM=gasnet

95a418e

Signed-off-by: John H. Hartman <jhh67@users.noreply.github.com>

Added chpl_comm_free_nb_handle to CHPL_COMM=ugni

1efd4f4

Signed-off-by: John H. Hartman <jhh67@users.noreply.github.com>

jhh67 added 3 commits October 9, 2024 11:13

Added chpl_comm_free_nb_handle to gasnet-ex

9bbb972

Signed-off-by: John H. Hartman <jhh67@users.noreply.github.com>

Fixed typos

2dfb923

Signed-off-by: John H. Hartman <jhh67@users.noreply.github.com>

Addressed reviewer's comments

b260c4f

Signed-off-by: John H. Hartman <jhh67@users.noreply.github.com>

jhh67 force-pushed the nbput branch from c210260 to b260c4f Compare October 9, 2024 18:13

jhh67 merged commit b3b7e55 into chapel-lang:main Oct 9, 2024
7 checks passed

jhh67 deleted the nbput branch October 9, 2024 19:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-blocking PUT in CHPL_COMM=ofi #25977

Non-blocking PUT in CHPL_COMM=ofi #25977

jhh67 commented Sep 23, 2024 •

edited

Loading

jhh67 commented Oct 2, 2024

bradcray commented Oct 9, 2024

jhh67 commented Oct 10, 2024

Non-blocking PUT in CHPL_COMM=ofi #25977

Non-blocking PUT in CHPL_COMM=ofi #25977

Conversation

jhh67 commented Sep 23, 2024 • edited Loading

jhh67 commented Oct 2, 2024

bradcray commented Oct 9, 2024

jhh67 commented Oct 10, 2024

jhh67 commented Sep 23, 2024 •

edited

Loading